Maftei Alexandru, Vasile Catalina
The dataset contains information on bookings in two types of hotels, both located in Portugal,a resort hotel and a city hotel. Each observation represents a hotel reservation. The data set includes reservations between July 1, 2015 and August 31, 2017, including those canceled. Because it is the actual data of the hotel, all information regarding the identification of the hotel or the client has been deleted. Due to the lack of real business data for scientific and educational purposes, these data sets can play an important role for research and education in revenue management, machine learning or data mining, as well as in other fields.
Every year, more than 140 million bookings were made on the Internet and hotel booking cancellations is a growing problem. An analysis of the last 5 years showed that the cancellation rate on booking has reached almost 40% and this trend produces a very negative impact on hotels revenue and distribution management strategies.
So, it is useful to try to understand and even predict which guests are more likely to cancel their bookings by getting insights from the data set and discovering which features have contributed more to cancellations.
## [1] 4
## [1] 0
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
We will try to answer to questions such as:
We will also try to find different relationships between the attributes.
## Removing package from 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Warning: package 'plotly' was built under R version 3.6.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'ggplot2' is in use and will not be installed
Most of the cancellations are made at city hotels.
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
However, clients still prefer city hotels instead of resorts.
##
## City Hotel Resort Hotel
## 79330 40060
##
## City Hotel Resort Hotel
## 0 46228 28938
## 1 33102 11122
## mydataset$adults length(is_canceled)
## 1 0 403
## 2 1 23027
## 3 2 89680
## 4 3 6202
## 5 4 62
## 6 5 2
## 7 6 1
## 8 10 1
## 9 20 2
## 10 26 5
## 11 27 2
## 12 40 1
## 13 50 1
## 14 55 1
Couples are more likely to cancel their bookings.
The transient tourists are more likely to cancel their bookings.
## mydataset$market_segment length(is_canceled)
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
The online segment has the greatest number of cancellations.
## mydataset$reserved_room_type length(is_canceled)
## 1 A 85994
## 2 B 1118
## 3 C 932
## 4 D 19201
## 5 E 6535
## 6 F 2897
## 7 G 2094
## 8 H 601
## 9 L 6
## 10 P 12
Room of type A has the greatest number of cancellations.
## mydataset$arrival_date_month length(is_canceled)
## 1 April 11089
## 2 August 13877
## 3 December 6780
## 4 February 8068
## 5 January 5929
## 6 July 12661
## 7 June 10939
## 8 March 9794
## 9 May 11791
## 10 November 6794
## 11 October 11160
## 12 September 10508
July and august are also the months with the greatest number of cancellations.
## PRT 0.406986
## GBR 0.101591
## FRA 0.087235
## ESP 0.071765
## DEU 0.061035
## Name: country, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003ABFD908>
## Text(0.5, 1.0, 'Most Popular Countries of Origin of the Guests')
## Text(0.5, 0, 'Country')
People from almost the entire world choose to spend their holiday in these two hotels. The majority is of course, from Portugal, and also from european countries, such as Great Britain and France.
## August 0.116233
## July 0.106047
## May 0.098760
## October 0.093475
## April 0.092880
## June 0.091624
## September 0.088014
## March 0.082034
## February 0.067577
## November 0.056906
## December 0.056789
## January 0.049661
## Name: arrival_date_month, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000002EC9AB38>
## Text(0.5, 1.0, 'Most Occupied (Busiest) Month with Bookings')
## Text(0.5, 0, 'Month')
## (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]), <a list of 12 Text major ticklabel objects>)
August is the month with the highest number of bookings, followed by July, while January is the most unoccupied month.
## 0 0.629584
## 1 0.370416
## Name: is_canceled, dtype: float64
## Text(0.5, 1.0, 'Proportion of Cancelled & Not Cancelled Bookings')
## ([<matplotlib.patches.Wedge object at 0x000000003A920CC0>, <matplotlib.patches.Wedge object at 0x000000003A92E748>], [Text(-0.4355420495755465, 1.0101005509609093, 'Not Cancelled'), Text(0.4355420495755469, -1.0101005509609091, 'Cancelled')], [Text(-0.2375683906775708, 0.5509639368877687, '63.0%'), Text(0.237568390677571, -0.5509639368877685, '37.0%')])
This piechart shows the proportion of cancelled and not cancelled bookings. 37% of the bookings were cancelled, which is a high percentage and which suggests that some measures should be taken.
## Online TA 0.473046
## Offline TA/TO 0.202856
## Groups 0.165935
## Direct 0.105587
## Corporate 0.044350
## Complementary 0.006223
## Aviation 0.001985
## Undefined 0.000017
## Name: market_segment, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A838F28>
##
## C:\Users\catal\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
## FutureWarning
## Text(0.5, 1.0, 'Total Number of Bookings by Market Segment')
## Text(0.5, 0, 'Market Segment')
Most of the bookings are made through online travel agents and less than 20% are made directly by tourists.
## Transient 0.750591
## Transient-Party 0.210436
## Contract 0.034140
## Group 0.004833
## Name: customer_type, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A9DF9E8>
## Text(0.5, 1.0, 'Total Number of Bookings by Customer Type')
## Text(0.5, 0, 'Market Segment')
This plot depicts that 75% of the bookings are transient bookings, 21% are transient-party and almost 3% are contract bookings.
## 0 0.588977
## 1 0.278298
## 2 0.108627
## 3 0.020915
## 4 0.002848
## 5 0.000335
## Name: total_of_special_requests, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A838908>
## Text(0.5, 1.0, 'Total Special Request')
## Text(0.5, 0, 'Number of Special Request')
Almost 60% of the bookings come with no special requests.
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A955DD8>
## Text(0.5, 1.0, 'Room price per night and person over the year')
## Text(0.5, 0, 'Arrival Month')
## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], <a list of 12 Text major ticklabel objects>)
## Text(0, 0.5, 'ADR [EUR]')
The price per night is the most expensive in July, August and September for the resort hotel, while for the city hotel, the highest prices are in March, April and May.
The guests who choose to book again are more likely to not cancel their bookings.
## No Deposit 0.876464
## Non Refund 0.122179
## Refundable 0.001357
## Name: deposit_type, dtype: float64
## Text(0.5, 1.0, 'Proportion of Total Bookings by Deposit Type')
## ([<matplotlib.patches.Wedge object at 0x000000003A70A908>, <matplotlib.patches.Wedge object at 0x000000003A90C160>, <matplotlib.patches.Wedge object at 0x000000003A73CEB8>], [Text(-1.0181924325401428, 0.4162741528343872, 'No Deposit'), Text(1.0164087119406244, -0.4206106635490841, 'Non Refundable'), Text(1.0999900062128796, -0.004688947833933911, 'Refundable')], [Text(-0.5553776904764415, 0.2270586288187566, '87.6%'), Text(0.5544047519676133, -0.2294239982995004, '12.2%'), Text(0.5999945488433888, -0.002557607909418497, '0.1%')])
The majority of the bookings are made without deposit.
## deposit_type is_canceled
## No Deposit 0 0.716230
## 1 0.283770
## Non Refund 1 0.993624
## 0 0.006376
## Refundable 0 0.777778
## 1 0.222222
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A96C160>
## Text(0.5, 1.0, 'Effect of Deposit Type on Cancellations')
## Text(0.5, 0, 'Depost Type')
Around 28% bookings with no deposit and 22% bookings with refund were cancelled. Guests who are not obliged to make a deposit are obviously prone to cancel their bookings.
## meal is_canceled
## BB 0 0.626151
## 1 0.373849
## FB 1 0.598997
## 0 0.401003
## HB 0 0.655397
## 1 0.344603
## SC 0 0.627606
## 1 0.372394
## Undefined 0 0.755346
## 1 0.244654
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A7DEA20>
## Text(0.5, 1.0, 'Effect of Meal type on Cancellations')
## Text(0.5, 0, 'Meal type')
The bed&breakfast meal is the most preffered by tourists and obviously, this type has the highest number of cancellations.
## required_car_parking_spaces is_canceled
## 0 0 0.605051
## 1 0.394949
## 1 0 1.000000
## 2 0 1.000000
## 3 0 1.000000
## 8 0 1.000000
## Name: is_canceled, dtype: float64
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003A77E080>
## Text(0.5, 1.0, 'Effect of Car Parking Space on Cancellations')
## Text(0.5, 0, 'Number of Car Parking Spaces')
Almost 40% of the bookings were cancelled by guests who did not ask for a parking space.
## <matplotlib.axes._subplots.AxesSubplot object at 0x000000003AA905C0>
## Text(0.5, 1.0, 'Arrival Year vs Lead Time By Cancellation Status')
## Text(0.5, 0, ' Arrival Year')
## Text(0, 0.5, 'Lead Time')
Bookings with lead time less than 100 days are less likely to be cancelled.
We will start by computing the correlations between each pair of numerical variables. These correlations will be represented in a correlation matrixt in order to have an idea of what variables are changing together.
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## corrplot 0.88 loaded
## is_canceled days_in_waiting_list
## is_canceled 1.000000000 0.054185824
## days_in_waiting_list 0.054185824 1.000000000
## required_car_parking_spaces -0.195497817 -0.030600046
## is_repeated_guest -0.084793418 -0.022234965
## previous_bookings_not_canceled -0.057357723 -0.009396978
## booking_changes -0.144380991 -0.011633945
## previous_cancellations 0.110132808 0.005928941
## lead_time 0.293123356 0.170084184
## total_of_special_requests -0.234657774 -0.082729719
## adr 0.047556598 -0.040756412
## stays_in_week_nights 0.024764629 -0.002019810
## stays_in_weekend_nights -0.001791078 -0.054151113
## adults 0.060017213 -0.008283347
## children 0.005036255 -0.033271416
## babies -0.032491089 -0.010620543
## required_car_parking_spaces is_repeated_guest
## is_canceled -0.19549782 -0.084793418
## days_in_waiting_list -0.03060005 -0.022234965
## required_car_parking_spaces 1.00000000 0.077089573
## is_repeated_guest 0.07708957 1.000000000
## previous_bookings_not_canceled 0.04765309 0.418055995
## booking_changes 0.06562019 0.012091787
## previous_cancellations -0.01849225 0.082293234
## lead_time -0.11645057 -0.124409908
## total_of_special_requests 0.08262634 0.013050009
## adr 0.05662809 -0.134314447
## stays_in_week_nights -0.02485942 -0.097244972
## stays_in_weekend_nights -0.01855381 -0.087239379
## adults 0.01478482 -0.146426116
## children 0.05625495 -0.032857741
## babies 0.03738336 -0.008942634
## previous_bookings_not_canceled booking_changes
## is_canceled -0.057357723 -0.1443809911
## days_in_waiting_list -0.009396978 -0.0116339446
## required_car_parking_spaces 0.047653087 0.0656201914
## is_repeated_guest 0.418055995 0.0120917873
## previous_bookings_not_canceled 1.000000000 0.0116075289
## booking_changes 0.011607529 1.0000000000
## previous_cancellations 0.152728115 -0.0269926626
## lead_time -0.073548168 0.0001488301
## total_of_special_requests 0.037823776 0.0528334357
## adr -0.072144196 0.0196176738
## stays_in_week_nights -0.048742550 0.0962094460
## stays_in_weekend_nights -0.042715235 0.0632813159
## adults -0.107983172 -0.0516727735
## children -0.021071664 0.0489516990
## babies -0.006550454 0.0834397814
## previous_cancellations lead_time
## is_canceled 0.110132808 0.2931233558
## days_in_waiting_list 0.005928941 0.1700841843
## required_car_parking_spaces -0.018492250 -0.1164505701
## is_repeated_guest 0.082293234 -0.1244099080
## previous_bookings_not_canceled 0.152728115 -0.0735481679
## booking_changes -0.026992663 0.0001488301
## previous_cancellations 1.000000000 0.0860418019
## lead_time 0.086041802 1.0000000000
## total_of_special_requests -0.048384118 -0.0957120489
## adr -0.065645638 -0.0630768525
## stays_in_week_nights -0.013992431 0.1657993639
## stays_in_weekend_nights -0.012774619 0.0856711329
## adults -0.006738096 0.1195186926
## children -0.024729166 -0.0376128161
## babies -0.007500998 -0.0209150163
## total_of_special_requests adr
## is_canceled -0.23465777 0.04755660
## days_in_waiting_list -0.08272972 -0.04075641
## required_car_parking_spaces 0.08262634 0.05662809
## is_repeated_guest 0.01305001 -0.13431445
## previous_bookings_not_canceled 0.03782378 -0.07214420
## booking_changes 0.05283344 0.01961767
## previous_cancellations -0.04838412 -0.06564564
## lead_time -0.09571205 -0.06307685
## total_of_special_requests 1.00000000 0.17218526
## adr 0.17218526 1.00000000
## stays_in_week_nights 0.06819178 0.06523748
## stays_in_weekend_nights 0.07267083 0.04934191
## adults 0.12288355 0.23064122
## children 0.08173584 0.32485303
## babies 0.09788879 0.02918569
## stays_in_week_nights stays_in_weekend_nights
## is_canceled 0.02476463 -0.001791078
## days_in_waiting_list -0.00201981 -0.054151113
## required_car_parking_spaces -0.02485942 -0.018553809
## is_repeated_guest -0.09724497 -0.087239379
## previous_bookings_not_canceled -0.04874255 -0.042715235
## booking_changes 0.09620945 0.063281316
## previous_cancellations -0.01399243 -0.012774619
## lead_time 0.16579936 0.085671133
## total_of_special_requests 0.06819178 0.072670830
## adr 0.06523748 0.049341906
## stays_in_week_nights 1.00000000 0.498968818
## stays_in_weekend_nights 0.49896882 1.000000000
## adults 0.09297551 0.091871020
## children 0.04420335 0.045793885
## babies 0.02019097 0.018482810
## adults children babies
## is_canceled 0.060017213 0.005036255 -0.032491089
## days_in_waiting_list -0.008283347 -0.033271416 -0.010620543
## required_car_parking_spaces 0.014784817 0.056254947 0.037383356
## is_repeated_guest -0.146426116 -0.032857741 -0.008942634
## previous_bookings_not_canceled -0.107983172 -0.021071664 -0.006550454
## booking_changes -0.051672774 0.048951699 0.083439781
## previous_cancellations -0.006738096 -0.024729166 -0.007500998
## lead_time 0.119518693 -0.037612816 -0.020915016
## total_of_special_requests 0.122883546 0.081735841 0.097888792
## adr 0.230641216 0.324853030 0.029185690
## stays_in_week_nights 0.092975513 0.044203353 0.020190974
## stays_in_weekend_nights 0.091871020 0.045793885 0.018482810
## adults 1.000000000 0.030440359 0.018145642
## children 0.030440359 1.000000000 0.024030235
## babies 0.018145642 0.024030235 1.000000000
The blue points suggest a positive correlation, while the red ones suggest a negative one. The bigger the point, the stronger the correlation.
We can observe some correlations between variables, both negative and positive, but very weak.
In order to build the model of logistic regression we will use the glm function. The response is represented by is_canceled and the predictors are represented by some independent variables.
We clearly specify that we want a logistic regression by setting the attribute family as “binomial”.
##
## Call:
## glm(formula = is_canceled ~ lead_time + customer_type + hotel +
## deposit_type + adr + total_of_special_requests, family = "binomial",
## data = mydataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9767 -0.8087 -0.5849 0.2038 2.7645
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8814713 0.0475111 -39.601 < 2e-16 ***
## lead_time 0.0048474 0.0000781 62.066 < 2e-16 ***
## customer_typeGroup -0.6239389 0.1473791 -4.234 2.30e-05 ***
## customer_typeTransient 0.6274113 0.0445352 14.088 < 2e-16 ***
## customer_typeTransient-Party -0.2813941 0.0466103 -6.037 1.57e-09 ***
## hotelResort Hotel -0.3099591 0.0152919 -20.270 < 2e-16 ***
## deposit_typeNon Refund 5.1679826 0.1047350 49.343 < 2e-16 ***
## deposit_typeRefundable -0.0388890 0.1961689 -0.198 0.843
## adr 0.0048739 0.0001476 33.031 < 2e-16 ***
## total_of_special_requests -0.5470358 0.0101321 -53.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157398 on 119389 degrees of freedom
## Residual deviance: 117225 on 119380 degrees of freedom
## AIC: 117245
##
## Number of Fisher Scoring iterations: 7
Summary returns the standard errors, z-score, estimations and p-values for each of the coefficients. According to the results, all coefficients, with one exception, are meaningful.
The next step is to make predictions.
## [1] 0.5234317 0.8816891 0.2378428 0.2431552 0.1728522 0.1728522
We will get an array of probabilities. The first two probabilities are bigger, measuring aproximatively 50% and 88%. We will compute the accuracy of the model and try to predict whether the bookings will be canceled or not, based on the variables chosen as predictors. In order to do this, we will use the ifelse command and as a threshold we will compute a mean.
## [1] 0.3704163
Confusion matrix
TP=true positive TN=true negative FP=false positive FN=false negative
##
## 0 1
## 0 62233 17077
## 1 12933 27147
## [1] 0.7486389
The accuracy of the model is almost 75%. According to the confusion matrix, the model predicted correctly that 62233 bookings won’t be canceled, but classified wrong 17077 bookings. By analogy, the model predicted wrong 12933 bookings and classified correctly 27147 bookings as being cancelled.
As a last step, we will analyse two parameters, AUC and ROC, which are used to assess the performance of the classification model. AUC= area under curve ROC= receiver operating characteristics
AUC can tell us whether the model is able to distinguish between the classes. The higher it is, the better the model will predict if the result will be 1 or 0.
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'pROC' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The area under the ROC curve is AUC and is computed as this:
## Area under the curve: 0.7209
AUC has a value close to 1, so we can say that we have a good prediction model.
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'lattice' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lattice'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\catal\OneDrive\Documents\R\win-
## library\3.6\00LOCK\lattice\libs\x64\lattice.dll to C:
## \Users\catal\OneDrive\Documents\R\win-library\3.6\lattice\libs\x64\lattice.dll:
## Permission denied
## Warning: restored 'lattice'
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Warning: package 'dplyr' is in use and will not be installed
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.1 v purrr 0.3.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## Warning: package 'lattice' was built under R version 4.0.5
## Warning: package 'caret' was built under R version 4.0.5
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
We will use PCA algorithm to reduce dimesion of our date
## lead_time arrival_date_year arrival_date_week_number
## 1 342 2015 27
## 2 737 2015 27
## 3 7 2015 27
## 4 13 2015 27
## 5 14 2015 27
## 6 14 2015 27
## arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults
## 1 1 0 0 2
## 2 1 0 0 2
## 3 1 0 1 1
## 4 1 0 1 1
## 5 1 0 2 2
## 6 1 0 2 2
## children babies is_repeated_guest previous_cancellations
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## previous_bookings_not_canceled booking_changes days_in_waiting_list adr
## 1 0 3 0 0
## 2 0 4 0 0
## 3 0 0 0 75
## 4 0 0 0 75
## 5 0 0 0 98
## 6 0 0 0 98
## required_car_parking_spaces total_of_special_requests predict predictions
## 1 0 0 0.5234317 1
## 2 0 0 0.8816891 1
## 3 0 0 0.2378428 0
## 4 0 0 0.2431552 0
## 5 0 1 0.1728522 0
## 6 0 1 0.1728522 0
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6172 1.3813 1.26126 1.1874 1.15730 1.04284 1.01679
## Proportion of Variance 0.1376 0.1004 0.08372 0.0742 0.07049 0.05724 0.05441
## Cumulative Proportion 0.1376 0.2381 0.32180 0.3960 0.46649 0.52373 0.57815
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00615 0.97687 0.96751 0.9399 0.93127 0.8707 0.81504
## Proportion of Variance 0.05328 0.05022 0.04927 0.0465 0.04565 0.0399 0.03496
## Cumulative Proportion 0.63143 0.68165 0.73092 0.7774 0.82306 0.8630 0.89793
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.75140 0.7029 0.61136 0.54979 0.45251
## Proportion of Variance 0.02972 0.0260 0.01967 0.01591 0.01078
## Cumulative Proportion 0.92764 0.9536 0.97331 0.98922 1.00000
PC 1 2 and 3 have the bigest values of Proportion of Variance and the smallest of Cumulative Proportion
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 1 -1.715752 -1.6041982 -0.9598965 0.5578969 0.369869042 2.5424196 0.5917407
## 2 -3.968765 -1.4271937 -1.1697921 1.4656700 0.004241446 3.7990130 1.4356120
## 3 1.057518 -1.9002481 -0.7015852 -0.7152206 0.911370494 -0.1746979 -0.8227414
## 4 1.021634 -1.9003218 -0.7037807 -0.7054257 0.907879305 -0.1702983 -0.8091639
## 5 1.298097 -0.5787277 -0.8997913 -0.8363931 0.429996163 -1.0175978 0.3036304
## 6 1.298097 -0.5787277 -0.8997913 -0.8363931 0.429996163 -1.0175978 0.3036304
## PC8 PC9 PC10 PC11 PC12 PC13
## 1 -2.322707 -0.5698578 0.01176266 -2.0810442 -3.0386984 0.33153092
## 2 -2.385715 -0.6223914 0.13305467 -2.9581197 -4.5118811 -0.98868782
## 3 -1.680015 -0.5546796 -0.79908238 0.3396690 0.1298267 -0.04607736
## 4 -1.676955 -0.5555551 -0.79701853 0.3379963 0.1177325 -0.07156416
## 5 -1.309910 -0.7148110 -0.63236473 0.1428980 -0.3949481 0.27986457
## 6 -1.309910 -0.7148110 -0.63236473 0.1428980 -0.3949481 0.27986457
## PC14 PC15 PC16 PC17 PC18 PC19
## 1 -1.80919683 0.5568728 -0.6000674 0.7629200 -0.3368832 -0.79694361
## 2 -2.73562692 0.6690934 -0.8049877 0.4787150 -2.1372014 -0.06477837
## 3 0.33140496 0.2452882 0.2160574 0.4865680 -0.4854662 -0.16119232
## 4 0.31731314 0.2470322 0.2141921 0.4803561 -0.5137804 -0.15123404
## 5 -0.02012024 0.1633696 0.5562939 0.8581534 -0.2581641 -0.21462075
## 6 -0.02012024 0.1633696 0.5562939 0.8581534 -0.2581641 -0.21462075
## PC1 PC2 PC3
## 1 -1.715752 -1.6041982 -0.9598965
## 2 -3.968765 -1.4271937 -1.1697921
## 3 1.057518 -1.9002481 -0.7015852
## 4 1.021634 -1.9003218 -0.7037807
## 5 1.298097 -0.5787277 -0.8997913
## 6 1.298097 -0.5787277 -0.8997913
## hotel arrival_date_month meal country market_segment
## 1 Resort Hotel July BB PRT Direct
## 2 Resort Hotel July BB PRT Direct
## 3 Resort Hotel July BB GBR Direct
## 4 Resort Hotel July BB GBR Corporate
## 5 Resort Hotel July BB GBR Online TA
## 6 Resort Hotel July BB GBR Online TA
## distribution_channel reserved_room_type assigned_room_type deposit_type agent
## 1 Direct C C No Deposit NULL
## 2 Direct C C No Deposit NULL
## 3 Direct A C No Deposit NULL
## 4 Corporate A A No Deposit 304
## 5 TA/TO A A No Deposit 240
## 6 TA/TO A A No Deposit 240
## company customer_type reservation_status reservation_status_date
## 1 NULL Transient Check-Out 2015-07-01
## 2 NULL Transient Check-Out 2015-07-01
## 3 NULL Transient Check-Out 2015-07-02
## 4 NULL Transient Check-Out 2015-07-02
## 5 NULL Transient Check-Out 2015-07-03
## 6 NULL Transient Check-Out 2015-07-03
Since we have imbalanced kind of dataset, Ensemble methods will avoid overfitting problems We dont required ‘reservation status date, company(94% missing values) & agent ID’ Randomforest wont work for variable which have more than 52 Levels. since ‘country’feature has 178 level, doing ’one hot encoding’ for all values will result curse in dimensionality
We add the categorical variables with the created response variables
## is_canceled hotel arrival_date_month meal country market_segment
## 1 0 Resort Hotel July BB PRT Direct
## 2 0 Resort Hotel July BB PRT Direct
## 3 0 Resort Hotel July BB GBR Direct
## 4 0 Resort Hotel July BB GBR Corporate
## 5 0 Resort Hotel July BB GBR Online TA
## 6 0 Resort Hotel July BB GBR Online TA
## distribution_channel reserved_room_type assigned_room_type deposit_type agent
## 1 Direct C C No Deposit NULL
## 2 Direct C C No Deposit NULL
## 3 Direct A C No Deposit NULL
## 4 Corporate A A No Deposit 304
## 5 TA/TO A A No Deposit 240
## 6 TA/TO A A No Deposit 240
## company customer_type reservation_status reservation_status_date PC1
## 1 NULL Transient Check-Out 2015-07-01 -1.715752
## 2 NULL Transient Check-Out 2015-07-01 -3.968765
## 3 NULL Transient Check-Out 2015-07-02 1.057518
## 4 NULL Transient Check-Out 2015-07-02 1.021634
## 5 NULL Transient Check-Out 2015-07-03 1.298097
## 6 NULL Transient Check-Out 2015-07-03 1.298097
## PC2 PC3
## 1 -1.6041982 -0.9598965
## 2 -1.4271937 -1.1697921
## 3 -1.9002481 -0.7015852
## 4 -1.9003218 -0.7037807
## 5 -0.5787277 -0.8997913
## 6 -0.5787277 -0.8997913
## 'data.frame': 119390 obs. of 18 variables:
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date: Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
## $ PC1 : num -1.72 -3.97 1.06 1.02 1.3 ...
## $ PC2 : num -1.604 -1.427 -1.9 -1.9 -0.579 ...
## $ PC3 : num -0.96 -1.17 -0.702 -0.704 -0.9 ...
## is_canceled hotel arrival_date_month meal country market_segment
## 1 0 Resort Hotel July BB PRT Direct
## 2 0 Resort Hotel July BB PRT Direct
## 3 0 Resort Hotel July BB GBR Direct
## 4 0 Resort Hotel July BB GBR Corporate
## 5 0 Resort Hotel July BB GBR Online TA
## 6 0 Resort Hotel July BB GBR Online TA
## distribution_channel reserved_room_type assigned_room_type deposit_type
## 1 Direct C C No Deposit
## 2 Direct C C No Deposit
## 3 Direct A C No Deposit
## 4 Corporate A A No Deposit
## 5 TA/TO A A No Deposit
## 6 TA/TO A A No Deposit
## customer_type reservation_status PC1 PC2 PC3
## 1 Transient Check-Out -1.715752 -1.6041982 -0.9598965
## 2 Transient Check-Out -3.968765 -1.4271937 -1.1697921
## 3 Transient Check-Out 1.057518 -1.9002481 -0.7015852
## 4 Transient Check-Out 1.021634 -1.9003218 -0.7037807
## 5 Transient Check-Out 1.298097 -0.5787277 -0.8997913
## 6 Transient Check-Out 1.298097 -0.5787277 -0.8997913
## country.ABW country.AGO country.AIA country.ALB country.AND country.ARE
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.ARG country.ARM country.ASM country.ATA country.ATF country.AUS
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.AUT country.AZE country.BDI country.BEL country.BEN country.BFA
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.BGD country.BGR country.BHR country.BHS country.BIH country.BLR
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.BOL country.BRA country.BRB country.BWA country.CAF country.CHE
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.CHL country.CHN country.CIV country.CMR country.CN country.COL
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.COM country.CPV country.CRI country.CUB country.CYM country.CYP
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.CZE country.DEU country.DJI country.DMA country.DNK country.DOM
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.DZA country.ECU country.EGY country.ESP country.EST country.ETH
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.FIN country.FJI country.FRA country.FRO country.GAB country.GBR
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 1
## 4 0 0 0 0 0 1
## 5 0 0 0 0 0 1
## 6 0 0 0 0 0 1
## country.GEO country.GGY country.GHA country.GIB country.GLP country.GNB
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.GRC country.GTM country.GUY country.HKG country.HND country.HRV
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.HUN country.IDN country.IMN country.IND country.IRL country.IRN
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.IRQ country.ISL country.ISR country.ITA country.JAM country.JEY
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.JOR country.JPN country.KAZ country.KEN country.KHM country.KIR
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.KNA country.KOR country.KWT country.LAO country.LBN country.LBY
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.LCA country.LIE country.LKA country.LTU country.LUX country.LVA
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.MAC country.MAR country.MCO country.MDG country.MDV country.MEX
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.MKD country.MLI country.MLT country.MMR country.MNE country.MOZ
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.MRT country.MUS country.MWI country.MYS country.MYT country.NAM
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.NCL country.NGA country.NIC country.NLD country.NOR country.NPL
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.NULL country.NZL country.OMN country.PAK country.PAN country.PER
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.PHL country.PLW country.POL country.PRI country.PRT country.PRY
## 1 0 0 0 0 1 0
## 2 0 0 0 0 1 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.PYF country.QAT country.ROU country.RUS country.RWA country.SAU
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.SDN country.SEN country.SGP country.SLE country.SLV country.SMR
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.SRB country.STP country.SUR country.SVK country.SVN country.SWE
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.SYC country.SYR country.TGO country.THA country.TJK country.TMP
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.TUN country.TUR country.TWN country.TZA country.UGA country.UKR
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.UMI country.URY country.USA country.UZB country.VEN country.VGB
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## country.VNM country.ZAF country.ZMB country.ZWE
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
We will use PCA for fictitious variables because there are too many columns
## PC1 PC2
## 1 -1.440801 0.03290349
## 2 -1.440801 0.03290349
## 3 1.150869 2.73906746
## 4 1.150869 2.73906746
## 5 1.150869 2.73906746
## 6 1.150869 2.73906746
## 'data.frame': 119390 obs. of 16 variables:
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel: Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ PC1 : num -1.72 -3.97 1.06 1.02 1.3 ...
## $ PC2 : num -1.604 -1.427 -1.9 -1.9 -0.579 ...
## $ PC3 : num -0.96 -1.17 -0.702 -0.704 -0.9 ...
## $ PC1.1 : num -1.44 -1.44 1.15 1.15 1.15 ...
## $ PC2.1 : num 0.0329 0.0329 2.7391 2.7391 2.7391 ...
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 52680 0
## 1 0 30880
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.6304
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6304
## Detection Rate : 0.6304
## Detection Prevalence : 0.6304
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22486 0
## 1 0 13344
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.6276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6276
## Detection Rate : 0.6276
## Detection Prevalence : 0.6276
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
We give up the Country variable
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22486 0
## 1 0 13344
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.6276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6276
## Detection Rate : 0.6276
## Detection Prevalence : 0.6276
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## `summarise()` has grouped output by 'reservation_status_date'. You can override using the `.groups` argument.
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'tseries' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Installing package into 'C:/Users/catal/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'forecast' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\catal\AppData\Local\Temp\Rtmp2xpGdY\downloaded_packages
## Warning: package 'tseries' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Augmented Dickey-Fuller Test
##
## data: hs
## Dickey-Fuller = -3.2251, Lag order = 2, p-value = 0.1057
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: hs
## KPSS Level = 0.42882, Truncation lag parameter = 2, p-value = 0.06473
## Warning: package 'forecast' was built under R version 3.6.3
##
## ARIMA(2,1,2)(0,1,0)[12] : Inf
## ARIMA(0,1,0)(0,1,0)[12] : 195.3849
## ARIMA(1,1,0)(0,1,0)[12] : 198.1213
## ARIMA(0,1,1)(0,1,0)[12] : 198.1003
## ARIMA(1,1,1)(0,1,0)[12] : 201.5643
##
## Best model: ARIMA(0,1,0)(0,1,0)[12]
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0)(0,1,0)[12]
## Q* = 4.6398, df = 5, p-value = 0.4614
##
## Model df: 0. Total lags used: 5
## ME RMSE MAE MPE MAPE MASE
## Training set -72.98426 286.6851 164.4508 -2.558678 5.461286 0.3415385
## ACF1
## Training set 0.02412717
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 2017 3205 2685.4155 3724.585 2410.3641 3999.636
## Oct 2017 3551 2816.1965 4285.803 2427.2151 4674.785
## Nov 2017 2909 2009.0532 3808.947 1532.6502 4285.350
## Dec 2017 2090 1050.8310 3129.169 500.7281 3679.272
## Jan 2018 2492 1330.1737 3653.826 715.1400 4268.860
## Feb 2018 2562 1289.2831 3834.717 615.5474 4508.453
## Mar 2018 3073 1698.3086 4447.691 970.5909 5175.409
## Apr 2018 3039 1569.3931 4508.607 791.4302 5286.570
## May 2018 3430 1871.2465 4988.754 1046.0922 5813.908
## Jun 2018 3055 1411.9295 4698.070 542.1405 5567.859
## Jul 2018 3193 1469.7331 4916.267 557.4908 5828.509
## Aug 2018 2954 1154.1065 4753.894 201.3004 5706.700
## Sep 2018 3062 983.6620 5140.338 -116.5437 6240.544
## Oct 2018 3408 1084.3474 5731.653 -145.7199 6961.720
## Nov 2018 2766 220.5661 5311.434 -1126.9052 6658.905
## Dec 2018 1947 -802.3828 4696.383 -2257.8181 6151.818
## Jan 2019 2349 -590.2139 5288.214 -2146.1397 6844.140
## Feb 2019 2419 -698.5071 5536.507 -2348.8156 7186.816
## Mar 2019 2930 -356.1410 6216.141 -2095.7189 7955.719
## Apr 2019 2896 -550.5337 6342.534 -2375.0185 8167.018
## May 2019 3287 -312.7871 6886.787 -2218.3993 8792.399
## Jun 2019 2912 -834.7772 6658.777 -2818.2012 8642.201
## Jul 2019 3050 -838.2145 6938.214 -2896.5108 8996.511
## Aug 2019 2811 -1213.6843 6835.684 -3344.2235 8966.224
## xhat level trend season
## Jul 2016 2728.014 2725.259 38.14394 -35.38889
## Aug 2016 3151.058 2895.019 169.76059 86.27778
## Sep 2016 3483.167 3098.038 203.01837 182.11111
## Oct 2016 3937.059 3250.514 152.47585 534.06944
## Nov 2016 3273.471 3312.104 61.58986 -100.22222
## Dec 2016 2350.017 3290.880 -21.22389 -919.63889
## Jan 2017 2416.407 3225.900 -64.97949 -744.51389
## Feb 2017 2849.278 3242.658 16.75814 -410.13889
## Mar 2017 3459.721 3205.467 -37.19093 291.44444
## Apr 2017 3361.930 3077.143 -128.32442 413.11111
## May 2017 3284.420 2881.538 -195.60472 598.48611
## Jun 2017 2810.547 2793.841 -87.69709 104.40278
## Jul 2017 3093.185 2851.023 57.18162 184.98057
## Aug 2017 3288.938 2998.999 147.97642 141.96220
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 7.270737 233.2177 218.2107 0.1472528 6.987008 0.4531895 0.3437704
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 2017 3248.897 2938.885 3558.910 2774.775 3723.020
## Oct 2017 3609.514 3222.398 3996.629 3017.471 4201.556
## Nov 2017 3064.943 2544.560 3585.327 2269.086 3860.801
## Dec 2017 2387.129 1690.133 3084.125 1321.165 3453.092
## Jan 2018 2848.577 1942.073 3755.080 1462.199 4234.954
## Feb 2018 3031.974 1889.385 4174.562 1284.535 4779.412
## Mar 2018 3747.504 2346.090 5148.918 1604.226 5890.782
## Apr 2018 3985.315 2304.844 5665.785 1415.257 6555.372